library(tidyverse) # for graphing and data cleaning
library(googlesheets4) # for reading googlesheet data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
gs4_deauth() # To not have to authorize each time you knit.
theme_set(theme_minimal()) # My favorite ggplot() theme :)
#Lisa's garden data
garden_harvest <- read_sheet("https://docs.google.com/spreadsheets/d/1DekSazCzKqPS2jnGhKue7tLxRU3GVL1oxi-4bEM5IWw/edit?usp=sharing") %>%
mutate(date = ymd(date))
# Seeds/plants (and other garden supply) costs
supply_costs <- read_sheet("https://docs.google.com/spreadsheets/d/1dPVHwZgR9BxpigbHLnA0U99TtVHHQtUzNB9UR0wvb7o/edit?usp=sharing",
col_types = "ccccnn")
# Planting dates and locations
plant_date_loc <- read_sheet("https://docs.google.com/spreadsheets/d/11YH0NtXQTncQbUse5wOsTtLSKAiNogjUA21jnX5Pnl4/edit?usp=sharing",
col_types = "cccnDlc")%>%
mutate(date = ymd(date))
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week. Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(week_day = wday(date, label = TRUE)) %>%
group_by(vegetable, week_day) %>%
mutate(wt_lbs = weight*0.00220462) %>%
summarize(daily_wt_lbs = sum(wt_lbs)) %>%
pivot_wider(id_cols = vegetable,
names_from = week_day,
values_from = daily_wt_lbs,
values_fill = 0)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot variable from the plant_date_loc table. This will not turn out perfectly. What is the problem? How might you fix it?garden_summary <- garden_harvest %>%
group_by(vegetable, variety, date) %>%
mutate(wt_lbs = weight*0.00220462) %>%
summarize(daily_wt_lbs = sum(wt_lbs))
garden_summary %>%
left_join(plant_date_loc,
by = c("vegetable", "variety"))
As shown above, there is a replication of certain vegetables and varieties. For example, in row 17 and 18, the beans(Bush Bush Slender variety) harvested on 2020-07-06 are reported as harvested in both plot M and D. However, in reality these vegetables and variaties have not been harvest twice. When Lisa collected her data, she didn’t report the plot where she harvest from. Therefore, there is a replication of certain vegetables and varieties reported while that isn’t accurate. This could be fixed if Lisa would report the plot in which each vegetable and variety is harvested.
garden_harvest and supply_cost datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.We would first need to add a variable to the dataset garden_harvest that prices the seed mix that is used, we could name this variable: variety_price. After creating the variable, we can use a left join by vegetable and select only the price of that vegetable and multiply that by the vegetable’s weight in pounds. Next, we could left join garden_harvest and supply_costs by variety. Finally, we could create a variable called money_saved which would subtract the price of the seed from the price_with_tax variable.
garden_harvest %>%
filter(vegetable %in% c("tomatoes")) %>%
mutate(wt_lbs = weight*0.00220462) %>%
ggplot(aes(y = fct_infreq(variety))) +
geom_bar() +
labs(title = "Tomatoe Variety",
x = "Count",
y = "Variety")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
mutate(variety_lower = str_to_lower(variety)) %>%
mutate(variety_length = str_length(variety)) %>%
mutate(variety2 = fct_infreq(variety)) %>%
distinct(vegetable, variety, .keep_all = TRUE) %>%
arrange(vegetable, variety_length)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
mutate(has_r = str_detect(variety, "er | ar")) %>%
distinct(variety, has_r = TRUE)
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density() +
labs(title = "Event Versus Date",
x = "Date",
y = "Density")
In the density plot above, we observe that the bikes are rented more often during October and November than in December and January. This can be explained by the transition from spring weather into winter weather.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
ggplot(aes(x = time_of_day)) +
geom_density() +
labs(title = "Event Versus Time of Day",
x = "Time of Day",
y = "Density")
In the density plot above, we observe the time of the day that bikes are rented out. As shown, bikes are rented out more often during two specific time periods of the day: early morning between 7am and 8am and in the afternoon between 5pm and 6pm. This can be explained by the fact that people use bikes during this time period to go to work in the morning and to go home in the afternoon. It’s the typical rush hour for public transportation.
Trips %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(y = days_of_week)) +
geom_bar() +
labs(title = "Event Versus Day of the Week",
x = "Count",
y = "Day of the Week")
As shown in the bar graph above,
Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day)) +
geom_density() +
facet_wrap(~days_of_week) +
labs(title = "Event Versus Time and Day of the Week",
x = "Time of Day",
y = "Density")
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises. Repeat the graphic from Exercise @ref(exr:exr-temp) (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day, fill=client), color="NA", alpha = 0.5) +
geom_density() +
facet_wrap(~days_of_week) +
labs(title = "Event Versus Time and Day of the Week",
x = "Time of Day",
y = "Density")
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
ggplot(aes(x = time_of_day, fill=client), color="NA", alpha = 0.5) +
geom_density(position = position_stack()) +
facet_wrap(~days_of_week) +
labs(title = "Event Versus Time and Day of the Week",
x = "Time of Day",
y = "Density")
weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
mutate(weekend = ifelse(days_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday")) %>%
ggplot(aes(x = time_of_day, fill=client), color="NA", alpha = 0.5) +
geom_density(position = position_stack()) +
facet_wrap(~weekend) +
labs(title = "Different Client Usage",
x = "Time of Day",
y = "Density")
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(time_of_day = hour(sdate) + (minute(sdate)/60)) %>%
mutate(days_of_week = wday(sdate, label = TRUE)) %>%
mutate(weekend = ifelse(days_of_week %in% c("Sat", "Sun"), "Weekend", "Weekday")) %>%
ggplot(aes(x = time_of_day, fill=days_of_week, color="NA", alpha = 0.5)) +
geom_density(position = position_stack()) +
facet_wrap(~client) +
labs(title = "Different Client Usage",
x = "Time of The Day",
y = "Density")
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
group_by(name) %>%
mutate(total_departures = n()) %>%
ggplot(aes(x = long, y = lat, color = total_departures)) +
geom_jitter() +
labs(title = "Total Departures from Each Rental Location",
x = "Longtitude",
y ="Latitude")
Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
group_by(client) %>%
mutate(percent_casual= mean(client == "Casual")) %>%
ggplot(aes(x = long, y = lat, color = percent_casual)) +
geom_jitter() +
labs(title = "Total Departures from Each Rental Location",
x = "Longtitude",
y ="Latitute")
as_date(sdate) converts sdate from date-time format to date format.Top_Trips <- Trips %>%
mutate(trip_date = as_date(sdate)) %>%
group_by(sstation, trip_date) %>%
count() %>%
arrange(desc(n)) %>%
head(10)
Top_Trips
Top_Trips %>%
left_join(Trips, by = "sstation")
Top_Trips %>%
left_join(Trips, by = "sstation") %>%
mutate(day_of_week = wday(sdate, label = TRUE)) %>%
group_by(day_of_week, client)
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?